BMJ Health & Care Informatics — Latest Matching Preprints

1

Design and preliminary safety validation of a hybrid deterministic-AI triage system for multilingual primary healthcare: a WhatsApp-based vignette study in South Africa

Nkosi-Mjadu, B. E.

2026-04-22 health informatics 10.64898/2026.04.21.26349781 medRxiv

Top 0.1%

8.3%

Show abstract

BackgroundSouth Africas public healthcare system serves most of the population through approximately 3,900 primary healthcare clinics characterised by long waiting times and high volumes of repeat-prescription visits. No published pre-arrival digital triage system operates across all 11 official South African languages while aligning with the South African Triage Scale (SATS). This paper reports the design and preliminary safety validation of BIZUSIZO, a hybrid deterministic-AI WhatsApp triage system. MethodsBIZUSIZO delivers SATS-aligned triage via WhatsApp, combining AI-assisted free-text classification (Claude Haiku 4.5) with a Deterministic Clinical Safety Layer (DCSL) that overrides AI output for 53 clinical discriminator categories (14 RED, 19 ORANGE, 20 YELLOW) coded in all 11 official languages and independent of AI availability. A five-domain risk factor assessment can only upgrade triage level. One hundred and twenty clinical vignettes in patient language (English, isiZulu, isiXhosa, Afrikaans; 30 per language) were scored against a developer-assigned gold standard with independent blinded nurse review. A 121-vignette multilingual DCSL safety consistency check across all 11 languages and a 220-call post-hoc framing sensitivity evaluation (110 paired vignettes) were also conducted. ResultsUnder-triage was 3.3% (4/120; 95% CI: 0.9%-8.3%) with no RED under-triage; exact concordance was 80.0% (96/120) and quadratic weighted kappa 0.891 (95% CI: 0.827-0.932). One two-level under-triage was observed on a non-RED presentation (V072, isiXhosa burns vignette, ORANGEGREEN); one two-level over-triage was observed (V054, isiZulu deep laceration, YELLOWRED). In the framing sensitivity evaluation, AI-only classification achieved 50.9% RED invariance under adversarial framing; full-pipeline classification achieved 95.0% in four validated languages, with the DCSL rescuing 18 of 23 AI drift cases. ConclusionsA hybrid deterministic-AI triage system with DCSL-based emergency detection achieved zero RED under-triage and consistent RED detection across all 11 official languages. The 16.7% over-triage rate falls within published South African SATS ranges (13.1-49%). A single two-level under-triage event was observed on an isiXhosa burns vignette (ORANGEGREEN) and is discussed in Limitations. Findings are preliminary; prospective validation against independent nurse triage is the necessary next step.

2

Large language models and retrieval augmented generation for complex clinical codelists: evaluating performance and assessing failure modes

Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.

2026-04-24 health informatics 10.64898/2026.04.23.26351098 medRxiv

Top 0.2%

3.5%

Show abstract

Objectives: Large language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & Methods: We set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). Results: We saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. Discussion: Besides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. Conclusions: Our findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.

3

Vision Language Model for Coronary Angiogram Analysis and Report Generation: Development and Evaluation Study

Jiang, Q.; Ke, Y.; Sinisterra, L. G.; Elangovan, K.; Li, Z.; Yeo, K. K.; Jonathan, Y.; Ting, D. S. W.

2026-04-21 cardiovascular medicine 10.64898/2026.04.19.26351241 medRxiv

Top 0.2%

2.9%

Show abstract

Coronary artery disease is a leading cause of morbidity and mortality. Invasive coronary angiography is currently the gold standard in disease diagnosis. Several studies have attempted to use artificial intelligence (AI) to automate their interpretations with varying levels of success. However, most existing studies cannot generate detailed angiographic reports beyond simple classification or segmentation. This study aims to fine-tune and evaluate the performance of a Vision-Language Model (VLM) in coronary angiogram interpretation and report generation. Using twenty-thousand angiogram keyframes of 1987 patients collated across four unique datasets, we finetuned InternVL2-4B model with Low-Rank Adaptor weights that can perform stenosis detection, anatomy labelling, and report generation. The fine-tuned VLM achieved a precision of 0.56, recall of 0.64, and F1-score of 0.60 for stenosis detection. In anatomy segmentation, it attained a weighted precision of 0.50, recall of 0.43, and F1-score of 0.46, with higher scores in major vessel segments. Report generation integrating multiple angiographic projection views yielded an accuracy of 0.42, negative predictive value of 0.58 and specificity of 0.52. This study demonstrates the potential of using VLM to streamline angiogram interpretation to rapidly provide actionable information to guide management, support care in resource-limited settings, and audit the appropriateness of coronary interventions. AUTHOR SUMMARYCoronary artery disease has heavy disease burden worldwide and coronary angiogram is the gold standard imaging for its diagnosis. Interpreting these complex images and producing clinical reports require significant expertise and time. In this study, we fine-tuned and investigated an open-source VLM, InternVL2-4B, to interpret and report coronary angiogram images in key tasks including stenosis detection, anatomy identification, as well as full report generation. We also referenced the fine-tuned InternVL2-4B against state-of-the-art segmentation model, YOLOv8x, which was evaluated on the same test sets. We examined how machine learning metrics like the intersection over union score may not fully capture the clinical accuracy of model predictions and discussed the limitations of relying solely on these metrics for evaluating clinical AI systems. Although the model has not yet achieved expert-level interpretation, our results demonstrate the potential and feasibility of automating the reporting of coronary angiograms. Such systems could potentially assist cardiologists by improving reporting efficiency, highlightning lesions that may require review, and enabling automated calculations of clinical scores such as the SYNTAX score.

4

Stakeholder perspectives on the use of enhanced mobile phone capabilities for public health surveillance for non-communicable disease risk factors: A qualitative study

Mwaka, E. S.; Nabukenya, S.; Kasiita, V.; Bagenda, G.; Rutebemberwa, E.; Ali, J.; Gibson, D.

2026-04-23 health informatics 10.64898/2026.04.22.26351443 medRxiv

Top 0.3%

1.9%

Show abstract

Background: Mobile phone-based tools are increasingly used to collect data on non-communicable disease (NCD) risk factors, particularly in low-resource settings where traditional data collection systems face operational and infrastructural constraints. This study examined stakeholder perspectives on the use of enhanced mobile phone-based capabilities to support the collection of public health surveillance data on NCD risk factors in low-resource settings. Methods: An exploratory qualitative study was conducted between November 2022 and July 2023. Twenty in-depth interviews were conducted with public health specialists, ethicists, NCD researchers, health informaticians, and policy makers in Uganda. Thematic analysis was used to interpret the results. Results: Four themes emerged from the data, including benefits of using mobile phone capabilities for NCD risk factor data collection; ethical, legal, and social implications; perceived challenges of using such mobile phone capabilities; and proposed solutions to improve the utility of phone-based capabilities in data collection on NCD risk factors. Participants recognized the potential of mobile technologies to improve data collection efficiency and expand access to hard-to-reach populations. However, concerns emerged regarding inadequate informed consent, risks to privacy and confidentiality, unclear data ownership, and vulnerabilities created by inconsistent enforcement of data protection laws. Social concerns included low digital literacy, unequal access to mobile devices, and fear of stigmatization. Participants emphasized the need for transparent communication, robust data governance, and community engagement. Conclusion: Mobile phone-based systems can strengthen the collection of NCD risk factor data in low-resource settings; however, their benefits depend on addressing key ethical, legal, and social challenges. To ensure responsible deployment, digital health initiatives must prioritize participant autonomy, data protection, equity, and trust building. Integrating contextualized ethical, legal, and social considerations into design and policy frameworks will be essential to leveraging mobile technologies in ways that support inclusive and effective NCD prevention and control.

5

Most Instability Phases Resolve: Empirical Evidence for Trajectory Plasticity in Multimorbidity Care from Longitudinal Relational Monitoring

Martin, C. M.; henderson, i.; Campbell, D.; Stockman, K.

2026-04-24 health informatics 10.64898/2026.04.22.26351537 medRxiv

Top 0.4%

1.7%

Show abstract

Background: The instability-plasticity framework proposes that multimorbidity trajectories periodically enter instability phases that are vulnerable to escalation but also potentially modifiable through relational intervention. Whether such phases commonly resolve without acute care, or predominantly progress to hospitalisation, has not been quantified at scale. Objective: To quantify instability window outcomes across a longitudinal monitoring cohort; to test whether the characteristics distinguishing admitted from resolved windows reflect within-patient trajectory dynamics or between-patient severity; and to characterise which patient-reported and operator-rated signals reliably precede admission, using both a curated pilot sub-cohort and the full monitoring cohort with an explicit cross-cohort comparison. Methods: Two complementary analyses were conducted on data from the MonashWatch Patient Journey Record (PaJR) relational telehealth system. Instability windows were identified algorithmically (>=2 consecutive calls with Total_Alerts >=3) across the full longitudinal dataset (16,383 calls, 244 patients, 2.5 years) and classified by linkage to ED and hospital admission data. Window characteristics were compared at window, patient, and paired within-patient levels. Pre-admission signal cascades were analysed in two configurations: a curated pilot sub-cohort (64 patients, 280 calls, +/-10-day window, 103 admissions, December 2016-September 2017) and the full monitoring cohort (175 patients, 1,180 pre-admission calls, +/-14-day window, December 2016-July 2019). A three-way cross-cohort comparison decomposed differences between the two configurations into pipeline and population effects. Results: 621 instability windows were identified across 157 patients (64% of the monitored cohort). 67.3% resolved without hospital admission or ED attendance, a rate stable across alert thresholds 1-5. In paired within-patient analysis (n = 70), duration in days (p = 0.002) and multi-domain breadth (p < 0.001) distinguished admitted from resolved windows; alert intensity did not. In the pilot sub-cohort, patient-reported illness prognosis (Q21) was the dominant pre-admission signal (GEE beta = +0.058, AUC = 0.647, p-BH = 0.018). This finding did not replicate in the full cohort: Q21 was non-significant (GEE beta = -0.008, p = 0.154, AUC = 0.507). Cross-cohort analysis identified selective curation of the pilot sub-cohort as the primary explanation. In the full cohort, six signals escalated significantly before admission after Benjamini-Hochberg correction: total alerts, health impairment (Q26), red alerts, self-rated health (Q3), patient concerns (Q1), and operator concern (Q34). Health impairment achieved the highest individual AUC (0.605) and showed the longest pre-admission lead. No individual signal exceeded AUC 0.61. Conclusions: Two thirds of instability phases resolve without hospitalisation, providing direct empirical support for trajectory plasticity as a clinically frequent phenomenon. Within the same patient, persistence - in duration and in the consistency of high-severity multi-domain flagging across calls - distinguishes trajectories that tip into admission from those that resolve. The Q21 signal reversal between cohorts illustrates how selective curation can produce compelling but non-replicable findings in monitoring research. In the full population, objective alert signals and operator judgement, rather than patient illness prognosis, carry the pre-admission signal

6

MIMIC-IV-Phenotype-Atlas (MIPA) : A Publicly Available Dataset for EHR Phenotyping

Yamga, E.; Goudrar, R.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350888 medRxiv

Top 0.4%

1.7%

Show abstract

Introduction Secondary use of electronic health records (EHRs) often requires transforming raw clinical information into research-grade data. A central step in this process is EHR phenotyping - the identification of patient cohorts defined by specific medical conditions. Although numerous approaches exist, from ICD-based heuristics to supervised learning and large language models (LLMs), the field lacks standardized benchmark datasets, limiting reproducibility and hindering fair comparison across methods. Methods We developed the MIMIC-IV Phenotype Atlas (MIPA) dataset, an adaptation of MIMIC-IV that provides expert-annotated discharge summaries across 16 phenotypes of varying prevalence and complexity. Two independent clinicians reviewed and labeled the discharge summaries, resolving disagreements by consensus. In parallel, we implemented a processing pipeline that extracts multimodal EHR features and generates training, validation, and testing datasets for supervised phenotyping. To illustrate MIPA's utility, we benchmarked four phenotyping methods : ICD-based classifiers, keyword-driven Term Frequency-Inverse Document Frequency (TF-IDF) classifiers, supervised machine learning (ML) models, and LLMs on the task. Results The final MIPA corpus consists of 1,388 expert-annotated discharge summaries. Annotation reliability was high (mean document-level kappa = 0.805, mean label-level kappa = 0.771), with 91% of disagreements resolved through consensus review. MIPA provides high-quality phenotype labels paired with structured EHR features and predefined train/validation/test splits for each phenotype. In the benchmarking case study, LLMs achieved the highest F1 scores in 13 of 16 phenotypes, particularly for conditions requiring contextual interpretation of clinical narrative, while supervised ML offered moderate improvements over rule-based baselines. Conclusion MIPA is the first publicly available benchmark dataset dedicated to EHR phenotyping, combining expert-curated annotations, broad phenotype coverage, and a reproducible processing pipeline. By enabling standardized comparison across ICD-based heuristics, ML models, and LLMs, MIPA provides a durable reference resource to advance methodological development in automated phenotyping.

7

DIRD+: A Browser-Based, Offline-First Clinical Platform for Diabetic Retinopathy Screening Using Edge AI Inference in Low-Resource Settings

Baier-Quezada, N.; Almendras, C.; Uribe-Hernandez, V.; Barrientos-Toledo, H.; Leiva-Fernandez, C.; Arrigo-Figueroa, M.; Brana-Pena, F.; Macilla-Leiva, A.; Lopez-Moncada, F.

2026-04-27 health informatics 10.64898/2026.04.26.26351745 medRxiv

Top 0.5%

1.6%

Show abstract

Background: Diabetic retinopathy (DR) is the leading cause of preventable blindness in working-age adults. In Chile, despite GES coverage since 2006, screening reaches only ~21% of the diabetic population under control. Chilean evidence shows that autonomous AI screening platforms have produced heterogeneous field results (sensitivity 40.8-100%, specificity 55.4%), while Ophthalmic Medical Technologists (TMOs) consistently achieve >97% sensitivity, suggesting AI is most effective as structured support for trained professionals rather than as an autonomous filter. Objective: We present DIRD+ (Diabetic Integrated Retinal Diagnosis), an open-source clinical platform that performs complete DR clinical workflows - patient management, AI-assisted lesion detection, clinical classification, annotation, and report generation - entirely within the web browser using WebAssembly-based inference, without transmitting patient data to any server. This work describes the system architecture and a preliminary technical validation. Methods: DIRD+ implements a six-stage inference pipeline using ONNX Runtime Web (v1.23) with SIMD and multi-thread optimizations, a pluggable clinical guideline engine (ICDR 2024, MINSAL Chile 2017), and a human-in-the-loop annotation workflow. A YOLOv26n detection model was trained on 500 pseudo-labeled APTOS 2019 images using the Annotix framework [11] and evaluated on the IDRiD test set (n=81 images). Results: Optic disc detection - the spatial calibration landmark - achieved AP=1.000 on IDRiD (IoU=0.1). Soft exudate detection achieved AP=0.243 (F1=0.364). Internal validation mAP50=0.578. Browser-based inference averaged 0.297 s/image (3.4 images/second) on CPU without GPU. Lesion detection performance reflects a first-generation model trained on 500 images; progressive improvement through collaborative annotation is ongoing. Conclusions: DIRD+ demonstrates that a complete offline-first DR clinical workflow can be deployed at zero cost within a standard web browser without server infrastructure or GPU. The pluggable guideline engine and human-in-the-loop architecture make DIRD+ a viable tool for TMO-assisted screening in connectivity-limited primary care settings.

8

Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases

Auger, S. D.; Varley, J.; Hargovan, M.; Scott, G.

2026-04-23 neurology 10.64898/2026.04.22.26351488 medRxiv

Top 0.5%

1.4%

Show abstract

Background: Current medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. Methods: We generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT 5.2/5 mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. Results: Subspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5 to 100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for more than 91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% 95% CI 5.6 to 8.8; Pro: 15.8% 95% CI 13.6 to 18.1) compared to GPT 5 mini (23.5% 95% CI 20.8 to 26.1), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT 5.2; 6.4% GPT 5 mini) compared to below 1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as more than 14 days old. Conclusion: Automated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk.

9

A Systematic Exploration of LLM Behavior for EHR phenotyping

Yamga, E.; Murphy, S.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350890 medRxiv

Top 0.6%

1.2%

Show abstract

Background Electronic health record (EHR) phenotyping underpins observational research, cohort discovery, and clinical trial screening. Large language models (LLMs) offer new capabilities for extracting phenotypes from unstructured text, but their performance depends on pipeline design choices-including prompting, text segmentation, and aggregation. No systematic framework has previously examined how these parameters shape accuracy and reproducibility. Methods We evaluated LLM-based phenotyping pipelines using 1,388 discharge summaries across 16 clinical phenotypes. A full factorial experiment with LLaMA-3B, 8B, and 70B systematically varied three pipeline components: prompting (zero-shot, few-shot, chain-of-thought, extract-then-phenotype), chunking (none, naive, document-based), and aggregation (any-positive, two-vote, majority), yielding 24 configurations per model. To compare intrinsic model capabilities, biomedical domain-adapted, commercial frontier (LLaMA-405B, GPT-4o, Gemini Flash 2.0), and reasoning-optimized models (DeepSeek-R1) were evaluated under a fixed configuration. Performance was assessed using precision, recall, and macro-F1; secondary analyses examined prediction consistency (Shannon entropy), self-confidence calibration, and the development of a taxonomy of recurrent model errors. Results Factorial ANOVAs showed that chunking and aggregation were the dominant drivers of performance, whereas the prompting strategy contributed minimally. Configuration effects were stable across model sizes, with no significant Model x Parameter interactions. Phenotype difficulty varied substantially (macro-F1 = 0.40-0.90), yet the highest-performing configuration-whole-document inference without aggregation-was consistent across phenotypes, as confirmed by mixed-effects modeling. In cross-model comparisons, DeepSeek-R1 achieved the highest macro-F1 (0.89), while LLaMA-70B matched GPT-4o and LLaMA-405B at substantially lower cost. Prediction entropy was low overall and driven primarily by phenotype difficulty rather than prompting or temperature. Self-confidence calibration was only moderately informative: high-confidence predictions were more accurate, but larger models exhibited systematic overconfidence. Conclusions LLM performance in EHR phenotyping is governed primarily by input structure and model capacity, not prompt engineering. Simple, document-level inference yields robust performance across diverse phenotypes, providing practical design guidance for LLM-based cohort identification while underscoring the continued need for human oversight for challenging phenotypes.

10

Decision Curve Analysis for Evaluating Machine Learning Models for Next-Day Transfer Out of ICU

Pozo, M.; Pape, A.; Locke, B.; Pettine, W. W.

2026-04-21 health informatics 10.64898/2026.04.19.26351213 medRxiv

Top 0.6%

1.2%

Show abstract

Timely identification of intensive care unit (ICU) patients likely to exit the unit can support anticipatory workflows such as chart review, eligibility screening, and patient outreach prior to transfer. Most ICU discharge prediction studies report discrimination and calibration, but these metrics do not quantify the decision consequences of acting on predictions. Using adult ICU admissions from MIMIC-IV, we represented each ICU stay as a sequence of daily clinical summaries and trained logistic regression, random forest, and XGBoost models to predict next day ICU transfer. Models achieved ROC AUC of 0.80-0.84 with differing calibration. We evaluated decision utility using decision curve analysis (DCA), where positive predictions trigger proactive review. Across thresholds, model guided strategies outperformed review-all, review-none, and a simple clinical rule. To translate net benefit into implementable operations, we modeled a clinical trial recruitment workflow with an 8 hour daily time constraint, incorporating chart review and consent effort. At a feasible operating threshold (0.23), the model flagged [~]23 charts/day and yielded [~]1.23 enrollments/day under conservative eligibility and consent assumptions. These results demonstrate that DCA provides a transparent framework for determining when ICU transfer predictions are worth using and how thresholds should be selected to align with real world workflow constraints. Data and Code AvailabilityThis research has been conducted using data from MIMIC-IV. Researchers can request access via PhysioNet. Implementation code is available upon request.

11

Real-time prospective (shadow mode) validation of an AI-based clinical decision support system for predicting 3-month functional outcome in acute stroke: the VALIDATE study protocol

Rubiera, M.; Bendszus, M.; Leker, R. R.; Hilbert, A.; Werren, I.; Lopez-Ramos, L. M.; Ayesta, M.; Nguyen, T. N. Q.; Bonekamp, S.; Sala, V.; Jubran, H.; Meza, C.; Shalabi, F.; Schwartzmann, Y.; Cano, D.; von Tottleben, M.; Kelleher, J.; Frey, D.

2026-04-27 neurology 10.64898/2026.04.26.26350937 medRxiv

Top 0.7%

0.9%

Show abstract

Introduction Despite the proven benefits of reperfusion therapies in acute ischemic stroke, treatment decisions in the hyperacute phase remain complex and are rarely supported by individualized outcome predictions. Artificial intelligence (AI)-based clinical decision support systems (CDSS) offer potential real-time prognostic estimates, but prospective evidence of their feasibility and performance in routine clinical workflows is limited. Our aim is to prospectively evaluate real-time feasibility, usability, and predictive performance of an AI-based CDSS (VALIDATE-CDSS) for individualized outcome prediction in acute stroke care. Methods and analysis Prospective, multicenter, observational study enrolling consecutive patients with acute ischemic stroke presenting to three tertiary stroke centers. Clinical management will follow standard practice at the discretion of treating physicians. In parallel, a dedicated researcher will collect patient data in real time and input them into the VALIDATE-CDSS using a mobile application, operating in shadow mode without influencing clinical decisions. The system will generate individualized predictions of 3-month functional outcome (modified Rankin Scale) for four treatment strategies (intravenous thrombolysis, endovascular thrombectomy, combined therapy, or no reperfusion) at three sequential time points: baseline clinical data, non-contrast CT, and CT angiography. The primary outcome is the real-world feasibility and usability of the VALIDATE-CDSS in the hyperacute stroke workflow. Secondary outcomes include predictive performance, agreement between model-suggested and actual treatments, incremental value with increasing data availability, and assessment of potential bias across predefined subgroups. This study will provide prospective real-world evidence on the implementation and clinical potential of AI-based decision support for personalized treatment selection in acute ischemic stroke Ethics and dissemination Patient enrollment began after approval from the ethics committees of all participating centers. Results will be disseminated through peer-reviewed open-access journals and conference presentations. Following open science principles, anonymized data and metadata will be made publicly available in the Zenodo repository upon study completion. Trial registration: ClinicalTrials.gov (NCT05622539).

12

Harmonising UK primary care prescription records for research: A case study in the UK Biobank

Ytsma, C. R.; Torralbo, A.; Fitzpatrick, N. K.; Pietzner, M.; Louloudis, I.; Nguyen, D.; Ansarey, S.; Denaxas, S.

2026-04-22 health informatics 10.64898/2026.04.21.26351274 medRxiv

Top 0.8%

0.8%

Show abstract

Objective The aim of this study was to develop and validate an automated, scalable framework to harmonise fragmented UK primary care prescription records into a research-ready dataset by mapping four diverse medical ontologies to a unified, historically comprehensive reference standard. Materials and Methods We used raw prescription records for consented participants in the UK Biobank, in which participants are uniquely characterized by multiple data modalities. Primary care data were preprocessed by selecting one drug code if multiple were recorded, cleaning codes to match reference presentations, expanding code granularity based on drug descriptions, and updating outdated codes to a single reference version. Harmonisation entailed mapping British National Formulary (BNF) and Read2 codes to dm+d, the universal NHS standard vocabulary for uniquely identifying and prescribing medicines. Harmonised dm+d records were then homogenised to a single concept granularity, the Virtual Medicinal Product (VMP). We validated our methods by creating medication profiles mapping contemporary drug prescribing patterns in 312 physical and mental health conditions. Results We preprocessed 57,659,844 records (100%) from 221,868 participants (100%). Of those, 48,950 records were dropped due to lack of drug code. 7,357,572 records (13%) used multiple ontologies. Most (76%) records were encoded in BNF and most had the code granularity expanded via the drug description (N=28,034,282; 49%). 41,244,315 records (72%) were harmonised to dm+d and 99.98% of these were converted to VMP as a homogeneous dataset. Across 312 diseases, we identified 23,352 disease-drug associations with 237 medications (represented as BNF subparagraphs) that survived statistical correction of which most resembled drug - indication pairs. Conclusion Our methodology converts highly fragmented and raw prescription records with inconsistent data quality into a streamlined, enriched dataset at a single reference, version, and granularity of information. Harmonised prescription records can be easily utilised by researchers to perform large-scale analyses in research.

13

Patterns of maternal transport in a state with levels of maternal care and no formal perinatal regions

Li, J.; Steimle, L. N.; Carrel, M.; Byrd, R. A.; Radke, S. M.

2026-04-22 health systems and quality improvement 10.64898/2026.04.20.26351263 medRxiv

Top 0.8%

0.8%

Show abstract

PurposeTo characterize maternal transport patterns in Iowa, a state with levels of maternal care and without formal perinatal regions, and assess whether transport decisions reflect efficient, risk-appropriate coordination. MethodsWe analyzed 2010-2023 Iowa birth records, which included 2,251 maternal transports between obstetric facilities across 106 unique routes. We characterized transport patterns and applied a community detection algorithm to identify "communities" of obstetric facilities that disproportionately transport among themselves. FindingsSuburban and rural counties have elevated transport rates compared to urban counties. 2,189 transports (97%) were from lower-to higher-level facilities. Among these, 2,037 (93%) were to Level III tertiary care centers. 567 transports (25.2%) bypassed a closer facility offering an equivalent or higher level of care than its destination facility. Health system affiliation was associated with bypassing transport, indicating potential organizational rather than purely geographic drivers of transport decisions. Three "communities" of obstetric facilities largely shaped by geographic proximity were identified. ConclusionsAlthough Iowa does not have formal perinatal regions, patterns of maternal transport are mostly in line with three de facto regions. Some potential inefficiencies were identified, such as obstetric facilities transporting to a farther facility when a closer facility offered the same level of care or higher. These findings may help identify opportunities to enhance care coordination among obstetric facilities, optimize maternal transport networks, and improve regionalization of maternal care.

14

MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support

Van Oyen, C.; Mirza-Haq, N.

2026-04-21 health informatics 10.64898/2026.04.14.26350711 medRxiv

Top 0.9%

0.8%

Show abstract

MedSafe-Dx (v0), introduces a new safety-focused benchmark for evaluating large language models in clinical diagnostic decision support using a filtered subset of the DDx Plus dataset (N=250). MedSafe-Dx evaluates three dimensions: escalation sensitivity, avoidance of false reassurance, and calibration of uncertainty. Models were tasked with providing a ranked differential (ICD-10), an escalation decision (Urgent vs. Routine), and a confidence flag. Performance was measured via a "Safety Pass Rate," a composite metric penalizing three hard failure modes: missed escalations of life-threatening conditions, overconfident incorrect diagnoses, and unsafe reassurance in ambiguous cases. Eleven models were evaluated and revealed a significant disconnect between diagnostic recall and safety. GPT-5.2 achieved the highest Safety Pass Rate (97.6%), while several models exhibited high rates of missed escalations or unsafe reassurance. MedSafe-Dx provides a robust stress test for identifying high-risk failure modes in diagnostic decision support and shows that high diagnostic accuracy does not guarantee clinical safety. While the benchmark is currently limited by synthetic data and proxy labels, it provides a reproducible, auditable framework for testing AI behavior before clinical deployment. Our findings suggest that interventions such as safety-focused prompting and reasoning-token budgets could be essential components for the safe deployment of LLMs in clinical workflows.

15

BRIDGE: a barrier-informed Bayesian Risk prediction model for risk IDentification, trajectory Grouping, and profiling of non-adherencE to cardioprotective medicines in primary care

Koh, H. J. W.; Trin, C.; Ademi, Z.; Zomer, E.; Berkovic, D.; Cataldo Miranda, P.; Gibson, B.; Bell, J. S.; Ilomaki, J.; Liew, D.; Reid, C.; Lybrand, S.; Gasevic, D.; Earnest, A.; Gasevic, D.; Talic, S.

2026-04-22 pharmacology and therapeutics 10.64898/2026.04.21.26351387 medRxiv

Top 0.9%

0.7%

Show abstract

BackgroundNon-adherence to lipid-lowering therapy (LLT) affects up to half of patients and contributes substantially to preventable cardiovascular morbidity and mortality. Existing measures, such as the proportion of days covered, provide cross-sectional summaries but fail to capture the dynamic patterns of adherence over time. Although group-based trajectory modelling identifies distinct longitudinal adherence patterns, no approach currently predicts trajectory membership prospectively while incorporating patient-reported barriers. We developed BRIDGE, a barrier-informed Bayesian model to predict adherence trajectories and identify their underlying drivers. MethodsBRIDGE incorporates patient-reported barriers as structured prior information within a Bayesian framework for adherence-trajectory prediction. The model was designed not only to estimate which patients are likely to follow different adherence trajectories, but also to generate clinically interpretable probability estimates that help explain why those trajectories may arise and what modifiable factors may be most relevant for intervention. ResultsBRIDGE achieved a macro AUROC of 0.809 (95% CI 0.806 to 0.813), comparable to random forest (0.815 (95% CI 0.812 to 0.819)) and XGBoost (0.821 (95% CI 0.818 to 0.824)), two widely used machine-learning benchmarks for structured clinical prediction. Calibration was superior to random forest (Brier score 0.530 vs 0.545; ), and performance was stable across six independent training runs (AUROC SD = 0.003). Incorporating barrier-informed priors improved accuracy by 3.5% and calibration by 5.5% compared to flat priors, showing that incorporation of patient-reported barriers added value beyond electronic medical record data alone. Four clinically distinct adherence trajectories were identified: gradual decline associated with treatment deprioritisation amid polypharmacy (10.4%), early discontinuation linked to asymptomatic risk dismissal (40.5%), rapid decline associated with intolerance (28.8%), and persistent adherence (20.2%). Counterfactual analysis identified trajectory-specific intervention levers. ConclusionsBRIDGE provides accurate and well-calibrated prediction of adherence trajectories while offering clinically actionable insights into their underlying drivers. By integrating patient-reported barriers with routine clinical data, the model supports targeted, mechanism-informed interventions at the point of prescribing to improve adherence to cardioprotective therapies. FundingMRFF CVD Mission Grant 2017451 Evidence before this studyWe searched PubMed and Scopus from database inception to December 2025 using the terms "medication adherence", "trajectory", "prediction model", "Bayesian", "lipid-lowering therapy", and "barriers", with no language restrictions. Group-based trajectory modelling has consistently identified three to five adherence patterns across cardiovascular cohorts; however, these applications have been descriptive rather than predictive. Machine-learning models for adherence prediction achieve moderate discrimination but treat adherence as a binary or continuous outcome, thereby overlooking the clinically meaningful heterogeneity captured by trajectory approaches. One prior study applied a Bayesian dynamic linear model to examine adherence-outcome associations, but it did not predict adherence trajectories or incorporate patient-reported barriers. To our knowledge, no published model integrates patient-reported barriers into trajectory prediction. Added value of this studyBRIDGE is, to our knowledge, the first model to incorporate patient-reported adherence barriers as hierarchical domain-informed priors within a Bayesian framework for trajectory prediction. Using 108 predictors derived from routine electronic medical records, the model achieves discrimination comparable to state-of-the-art machine-learning approaches while additionally providing uncertainty quantification, barrier-level interpretability, and counterfactual insights to inform intervention strategies. The identified trajectories differed not only in adherence level but also in switching behaviour, drug-class evolution, and medication burden, suggesting distinct underlying mechanisms of non-adherence that may require tailored clinical responses. Implications of all the available evidenceEach adherence trajectory implies a distinct intervention target: asymptomatic risk communication for early discontinuers (40.5% of patients), proactive tolerability management for rapid decliners, medication simplification for patients with gradual decline associated with polypharmacy, and maintenance support for persistent adherers. By integrating routinely collected clinical data with patient-reported barriers, BRIDGE can be deployed within existing primary care EMR infrastructure to generate actionable, trajectory and patient--specific recommendations at the point of prescribing, helping to bridge the gap between adherence measurement and targeted adherence management.

16

Development of Explainable Machine Learning Framework for Early Detection and Risk Stratification of Diabetes in Age Specific Variations

Lukhele, N.; Mostafa, F.

2026-04-27 health informatics 10.64898/2026.04.25.26351733 medRxiv

Top 0.9%

0.7%

Show abstract

Objective To develop and evaluate a novel machine learning (ML) framework tailored to a clinical diabetes dataset and to assess whether demographic stratification enhances model performance and interpretability for multiclass diabetes classification. Methods A clinical dataset of 264 patients records was used to classify individuals into non-diabetic, prediabetic and diabetic categories. Several supervised learning models were trained using 80:20 train-test split and optimized using RandomizedSearchCV Model and 10-fold cross validation. Model performance was evaluated using the metrics accuracy, precision, recall and the F1-score. Area under the receiver operating characteristic curve (AUC) was calculated for the best generalizing model. A structured ML framework was developed for this dataset, incorporating preprocessing, model optimization, age stratification analysis age (<35 vs >35 years) and gender. SHAP was developed for model interpretability. Results Ensemble methods demonstrated superior performance in comparison to linear or single-tree approaches, with Gradient Boosting showing the most stable generalization with a test accuracy of 0.981 and stable cross validation accuracy of 0.972. AUC-ROC analysis using Gradient Boosting yielded good discriminative ability across the three diabetes classes: 0.991 (non-diabetic), 0.986 (prediabetic) and 0.972 (diabetic). Stratified analysis showed improved reliability in individuals aged >;35 years (accuracy = 0.94, F1-score = 0.92), while performance in younger individuals was unstable due to small sample size. SHAP analysis identified HbA1c, BMI, and age as dominant predictors. Conclusion This study presents a ML framework integrating age stratified modelling with explainable ML frameworks to improve interpretability. The findings offer clinically relevant results that can support clinical decision-making systems, individualized risk assessment, and potential applications for targeted intervention in diabetes progression.

17

A profile analysis of peripherally inserted central catheters implanted over 10 years in a quaternary hospital

da Luz, C. C.; Sorbello, C. C. J.; Epifanio, E. A.; dos Santos, C. d. A.; Brandi, S.; Guerra, J. C. d. C.; Wolosker, N.

2026-04-23 health systems and quality improvement 10.64898/2026.04.22.26351492 medRxiv

Top 1%

0.7%

Show abstract

Abstract: Background: Vascular access is essential in treating patients undergoing prolonged endovenous therapy such as chemotherapy, antibiotics, and parenteral nutrition. Since the 1990s, when PICCs (peripherally inserted central catheters) appeared, vascular access options have expanded significantly, revolutionizing the treatment landscape for all types of patients. Objective: To analyze and describe the profile of the use of PICCs in a Brazilian quaternary hospital over 10 years with data collected by the infusion therapy team. Evaluating the number of PICCs implanted over the years, patients epidemiology and clinical characteristics, insertion details, associated complications, and the reason for removal. Methods: A retrospective cohort study that employs a quantitative, non-experimental approach to classify and statistically analyze past events associated with 21,652 PICCs implanted from January 2012 to December 2021 in a quaternary hospital at Sao Paulo - Brazil. All the catheters were implanted, and the data was collected by a team of nurses specializing in infusion therapy. We analyzed the number of catheters implanted over the years, insertion characteristics, patients epidemiology and clinical data, possible associated complications, and the reason for removal. Statistical analyses were conducted using R software (version 4.4.1) and SPSS (version 29) for Windows (IBM Corp, Armonk, NY). Results: During the specified period, 21,652 catheters were analyzed. The patients gender distribution was nearly balanced (48.2% versus 51.8%), and the average age was 66 years. Cardiovascular and metabolic issues were the most common comorbidities, and between 2020 and 2021, 29.3% of the sample tested positive for COVID-19. The most common location of hospitalization and implantation was the medical-surgical clinic (31.6% - 41.4%), and the most used type of catheter was the Power Picc (83.9%). The estimated complication incidence density is 2.94 complications per 1,000 catheter-days. Almost all the PICCs (98,2%) were adequately located at the cavo-atrial junction after the first attempt, 82.2% of catheters were removed after therapy, and the median duration of catheter use was 12 days. Conclusion: PICCs are widely employed for drug infusion, with their use growing progressively due to specialized teams greater availability and training. The high efficiency of these devices with a relatively low risk of complications already observed in previous studies was reinforced by the findings of this study of more than 20,000 catheters.

18

Operationalisation of the African Medicines Agency: Retrospective evaluation of the continental centralised pilot procedure - timelines to recommendation and national registration decisions

ISMAIL, A. J.; MOETI, L.; DARKO, D. M.; WALKER, S.; SALEK, S.

2026-04-24 health systems and quality improvement 10.64898/2026.04.22.26351547 medRxiv

Top 1%

0.5%

Show abstract

Background Regulatory inconsistency across African countries contributes to duplicative scientific assessments, prolonged approval timelines, and delayed access to essential medical products. To inform the operationalisation of the African Medicines Agency (AMA), the African Medicines Regulatory Harmonisation (AMRH) programme implemented Africa's first continental pilot study for the scientific evaluation and listing of human medicinal products. This study evaluates the pilot's procedural performance and examines how continental scientific opinions were translated into national regulatory decisions through reliance mechanisms. Methods and Findings A mixed-methods programme evaluation was conducted using regulatory datasets generated during the pilot study. Quantitative data included assessment timelines, GMP inspection outcomes and national post-listing regulatory actions. Retrospective qualitative thematic analysis was applied to governance documents and National Regulatory Authority (NRA) feedback to identify legal, institutional and procedural determinants influencing uptake. Of 64 expressions of interest, 24 products progressed to full evaluation and 12 received positive continental scientific opinions. Ten met the predefined performance target of [≤]210 working days. Twenty-four GMP inspections identified no critical deficiencies and aligned with global regulatory benchmarks. National uptake demonstrated active reliance: full reliance (continental opinion as primary basis for national approval) for 7 products (58%); sequential reliance (continental assessment supplemented with targeted national queries) for 3 products (25%); and supplemented national review (separate national assessment undertaken) for 2 products (17%). Products with broader market strategies achieved registration in up to 23 African countries within a median of 77 working days post-listing. Variability in uptake reflected national legal authority, administrative requirements, and applicant submission strategies Conclusions The pilot study demonstrates the feasibility of a continent-wide regulatory assessment mechanism capable of producing trusted scientific outputs and enabling reliance-based national decision-making in Africa. While reliance was widely applied, heterogeneity in national procedures and administrative sequencing affected time to national registration. Findings provide empirical evidence to inform the AMA scale-up, highlighting the need for harmonised reliance pathways, streamlined administrative processes, and coordinated digital regulatory infrastructure.

19

"Isn't social prescribing what social workers have been doing forever"?: UK social worker perspectives on social prescribing and professional boundaries

White, C.; Price, E.; Walker, L.; Bell, J.; Revell, L.

2026-04-27 primary care research 10.64898/2026.04.24.26351583 medRxiv

Top 1%

0.5%

Show abstract

Social prescribing has assumed increasing dominance in policy and practice internationally, including in the UK, where it has an increasing role in addressing social needs such as isolation, and social determinants of ill health. Although General Practitioners are perceived as key referral sources, social workers in one locality were found to play a significant role in referral. This suggests that the social work role in this context has been under recognised and under explored. This study sought to explore social workers' perceptions and experiences of social prescribing through an online survey conducted from January to June 2022. All UK social workers were eligible to participate, regardless of whether they had made referrals. A total of 105 responses were collected from all UK nations. Data was analysed using inductive thematic analysis. Four key themes were generated: contended and contested boundaries; complementary spaces; delineated spaces of simplicity and complexity; social work under threat. Participants recognised that social prescribing could provide valuable client support and could be a useful resource for social workers. However, they also expressed concerns about overlapping professional boundaries and the potential for social prescribing to encroach on social work, perceiving it as most appropriate for the delivery of support to those with 'low level' needs.

20

Multi-Hospital Electronic Health Record Foundation Models Without Data Sharing: A Comparison of Federated Learning and Inference-Time Ensembling

Elemento, O.

2026-04-27 health informatics 10.64898/2026.04.24.26351702 medRxiv

Top 1%

0.5%

Show abstract

Background. Foundation models for electronic health records (EHRs) perform strongly on clinical prediction, but every published model has been trained within a single health system. No multi-institutional EHR foundation model currently exists, largely because privacy regulations and governance barriers block data pooling across hospitals. Two strategies could build such models without pooling: federated learning (exchanges model weights) and inference-time ensembling (exchanges only predictions at query time). Whether either is viable for autoregressive EHR foundation models, and whether individual hospitals benefit from participating, is not established. Methods. We trained a generative pretrained transformer (GPT) style EHR foundation model on 100,163 Medical Information Mart for Intensive Care (MIMIC-IV) patients, partitioned into five heterogeneously distributed (non-IID) sites by Dirichlet allocation over International Classification of Diseases (ICD) chapters. We compared centralized training, federated averaging, and inference-time ensembling, and each hospital's solo model against the ensemble including it. Models were evaluated on 15,012 held-out patients using per-condition area under the receiver operating characteristic curve (AUROC) for five acute conditions and micro-averaged area under the precision-recall curve (AUPRC) across 2,590 diagnoses. Results. Centralized training achieved per-condition AUROC 0.75-0.85 and overall AUPRC 0.376. Federated averaging recovered 85% of centralized AUPRC (0.321) and 98-100% of per-condition AUROC. Inference-time ensembling, requiring no training-time exchange, recovered 77% of AUPRC (0.291) and 97-99% of per-condition AUROC. An estimated 87% of participating hospitals received a better model from the ensemble than from training alone; only hospitals with ~40% of the network's patients matched the ensemble on their own. FedProx collapsed to the marginal baseline. Conclusions. Multi-institutional EHR foundation models can be built without pooling patient data. Inference-time ensembling benefits most participating hospitals and imposes the lightest governance burden; federated learning recovers more performance but requires weight sharing. These findings offer a practical path toward collaborative clinical AI.